How to Capture the Zeitgeist via Eigen-tweets?
A Short Answer Arguably, the aforementioned enormous daily load of tweets largely encapsulates the worldwide emerging trends and breaking news. Therefore, in the past few years, tweet trend-clustering and geo-temporal trend tracking (a.k.a. zeitgeist tracking) has attracted considerable interest, with a plethora of online applications B. Parr (2009, Apr. 29). "Swine flu's tweet tweet causes online flutter," ''Business Standard'' [Online]. Available: http://mashabble.com/2009/04/04/twitter-trends performing data (tweet message) and metadata (tweet author, location, number of re-tweets, etc.) search queries on the open tweet database. Modern popular techniques are either (i) singling out the most popular tweets or (ii) delivering in single words (or hashtags) the pivotal points of emerging trends and breaking news. Although existing approaches seem to largely capture the ``spirit of the time", they suffer from some major drawbacks. On the one hand, the ``popular tweets" approach fails to accomplish ''complete'' trend identification, since the most popular tweets may span only a small subset of the emerging trends, while, on the other hand, single-word trend titling lacks in concept ''descriptiveness''. Following motivational Example 1 highlights these drawbacks. Example 1 In Table I, lies a pool of 8 tweets, available for processing. We consider these tweets to have been posted simultaneously in [http://en.wikipedia.org/wiki/Greece Greece] and to be, other than geographically, metadata-wise independent. In the second column, as a measure of popularity, we append the aggregate number of re-tweets, favors, and replies. Say that we are interested in extracting 3 objects (let them be words, phrases or tweets) that encapsulate the major emerging trends and/or breaking news and are homogeneous; i.e., an object cannot contain information for more than one trend or piece of news. Certainly, the first popular tweets method would return items 1, 4, and 8, capturing only tweets referring to the recent financial developments in [https://en.wikipedia.org/wiki/Cyprus Cyprus] . On the other hand, the single-word trend titling, in its simplest variant, would return the words ``Greek", ``Cyprus", and ``bailout", failing to offer a comprehensive depiction of the trends. State-of-the-art research on large-scale [http://en.wikipedia.org/wiki/Sparse_PCA sparse principal component analysis (PCA)] D. S. Papailiopoulos, A. G. Dimakis, S. Korokythakis, "Sparse PCA through low-rank approximations," ''arXiv:1303.055lvl [stat.ML]'', Mar. 2013. introduces the concept of Eigen-tweet (ET) basis. An ET is a tweet-like collection of key-words put together to comprehensively describe a piece of news or a trend. Respectively, ET basis is a set of ET's that are content-wise orthogonal to each other and span/capture the most important part of the zeitgeist similarly to the way the eigen-basis spans the range of a matrix. Although based on a simple concept, the Eigen-tweet basis extraction can be quite a challenging task (or even intractable) in terms of complexity, considering that meaningful results can only be derived operating on enormous tweet data sets. Interestingly, authors in , M. Asteris, D. S. Papailiopoulos, and G. N. Karystinos, ``Sparse principal component of a rank-deﬁcient matrix," in ''Proc. IEEE Intern. Symp. Inf. Theory'', St. Petersburg, Russia, Aug. 2011, pp. 673-677. have delivered novel techniques for efficient large scale sparse PCA, that, applied on huge tweet data set can deliver its principal components practically fast. In the sequel, we formally introduce the Eigen-tweets concept. A Long Answer Problem Statement Let a collection of N tweet messages t_1, t_2, \cdots comprise the available tweet data-set \mathcal T=\{ t_1, t_2, \cdots, t_N\} \ \ \ \ \ \ \ \ \ \ (1) with cardinality |\mathcal T|=N . Following the bag-of-words model W. Cohen, ``Bag of words and word positions," in ''Advances in inductive logic programming'', Amsterdam, NL: IOS Press, 1995, pp. 124--143., we build our dictionary \mathcal D=\{w_1, w_2, \cdots, w_M\} \ \ \ \ \ \ \ \ \ \ (2) as a selection of M non-trivial (pronouns, proverbs, pro-adverbs, pro-adjectives, etc. are considered to be trivial) preferably grammatically uncorrelated words appearing in \mathcal T . Certainly, for the problem at hand, we operate in the intersection of high-dimensionality and big-data regimes and N and M are large numbers, at maybe in the order of tens of thousands. Then, we build the boolean mapping matrix \mathbf S \in \{0,1\}^{M \times N} such that [\mathbf S]_{i,j}=1 \Leftrightarrow w_i appears in x_j. \ \ \ \ \ \ \ \ \ \ (3) Importantly, since each tweet constitutes of 140 characters, with the English words being 4.5 characters long on average R. J. Pierce, ''An introduction to information theory: Symbols, signals and noise'', 2nd ed. New York, NY: Dover Publications, 1980., the columns of \mathbf S will be sparse. Next, we define the M \times M nonnegative word-correlation matrix \mathbf A_w ~~ \overset{\triangle}{=} ~~ \mathbf S \mathbf S^T\ \ \ \ \ \ \ \ \ \ (4) so that [\mathbf A_w]_{i,j} \leq \min_{i,j} \| [\mathbf S]_{i,:}^T \|_1 counts the number of tweets in \mathcal T in which w_i and w_j coexist. In a large-scale weighted graph interpratation, if each word in the dictionary is represented by a vertix and two vertices are connected with an edge weighted by the number of tweets in which these two words coexist, then \mathbf A_w is the graph's (weighted) adjacency matrix. Notice that the N \times N tweet-correlation matrix \mathbf A_t \overset{\triangle}{=} \mathbf S^T\mathbf S is such that [\mathbf A_t]_{i,j} counts the number of common words in t_i and t_j . Vanilla PCA Eigen-tweets Simple principal eigen-tweet extraction could be carried out by means of ``explained variance" maximization \mathcal P: ~~~~ \begin{array}{l} \underset{\mathbf x\in \mathbb{R}^{M \times M}}{\mathrm{maximize}} ~~\mathbf x^T \mathbf A \mathbf x \\ \mathrm{subject~ to} ~\| \mathbf x\|_2 =1 \end{array} \ \ \ \ \ \ \ \ \ \ (5) also known as the ``Vanilla PCA" problem. Then, denoting by \mathcal I^* the set of the indices of the positive entries of \mathbf x^* , |\mathcal I^*| = \| \mathbf x^* \|_0 , with \| \cdot \|_0 denoting the l_0 -norm of a vector --counting the number of non-zero entries, we form the eigen-tweet t^* = \bigcup_{i \in \mathcal I^*} w_i. \ \ \ \ \ \ \ \ \ \ (6) The, for now intuitive, rationale behind this eigen-tweet construction is that [\mathbf x^*]_i will have a great value if is w_i popular and, therefore \|[\mathbf A]_{:,i}\|_1 large. Certainly, since there is no guarantee that \mathbf x^* is sparse, t^* may be too long and hard to interpret --hence, inappropriate for an eigen-tweet. This simple observation motivates the pursue of a '''sparse''' principal component of the adjacency matrix \mathbf A_w . Sparse PCA Eigen-tweets An alternative problem formulation for the derivation of the principal eigen-tweet, conforming with the risen sparsity demand, is \mathcal P_s: ~~~~ \begin{array}{l} \underset{\mathbf x\in \mathbb{R}^{M \times M}}{\mathrm{maximize}} ~~\mathbf x^T \mathbf A \mathbf x \\ \mathrm{subject~ to} ~\| \mathbf x\|_2 =1~~ \mathrm{ and }~~ \| \mathbf x\|_0 = k \end{array} \ \ \ \ \ \ \ \ \ \ (7) where k is the number of words comprising the eigen-tweet and \mathcal P_s is the well-known sparse principal-component analysis problem A. d'Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, ``A direct formulation of sparse PCA using semideﬁnite programming," ''SIAM Review'', 49(3), 2007., A. A. Amini and M. Wainwright, ``High-dimensional analysis of semideﬁnite relaxations for sparse principal components," ''The Annals of Statistics'', 37(5B):2877-2921, 2009., B. Moghaddam, Y. Weiss, and S. Avidan, ``Generalized spectral bounds for sparse lda," in ''Proceedings of the 23rd international conference on Machine learning'', pp. 641-648, 2006. , B. Moghaddam, Y. Weiss, and S. Avidan, ``Spectral bounds for sparse PCA: exact and greedy algorithms," ''Advances in Neural Information Processing Systems'', 2006, p. 18., A. d’Aspremont, F. Bach, and L. El Ghaoui ``Optimal solutions for sparse principal component analysis," ''Journal of Machine Learning Research'', 2008, pp. 1269-1294., H. Zou, T. Hastie, and R. Tibshirani, ``Sparse principal component analysis," ''Journal of Computational and Graphical Statistics'', 2006, pp. 265-286., H. Shen and J. Z. Huang , ``Sparse principal component analysis via regularized low rank matrix approximation," ''J. Multivar. Anal.'', July 2008, pp. 1015-1034.. The l_0 cardinality constraint limits the optimization over vectors with k non-zero entries and (7) is known to be [http://en.wikipedia.org/wiki/NP-hard NP-hard] -having (\begin{array}{c} M \\ k\end{array}) candidate solutions- and, thus, computationally intractable in general. However, recent algorithmic developments on sparse PCA in show the way for tractable solution approximation to \mathcal P_s , with proven performance degradation bounds, and present the derived results in the context of zeitgeist-recording in large tweet data-sets. The state-of-the-art algorithm developed in , baring the name ''Spannogram'', is briefly presented below. Efficient Sparse PCA via ''Spannogram'' According to , the NP-hard solution to \mathcal P_{s} can be efficiently approximated -with bounded performance degradation- in the three following steps. #Approximate A_{w} by "best" rank- d approximation A_{w}^{(d)} by means of conventional singular-value decomposition (SVD) in (5). #Use A_{w}^{(d)} to obtain M^{(d)} candidate sparse principal components (Spannogram). # Browse the size M^{(d)} candidate support and make a maximum-explained-variance choice. Low-rank Approximation Observe that \mathbf \rho ~~ \overset{\triangle}{=} ~~ \mathbf rank(A_{w}) \leqslant min(M,N) and let some d of interest such that d \leqslant \rho . The problem of finding the rank- d approximation of A_w is formally \mathcal P_{LR}: ~~~~ \begin{array}{l} \underset{\mathbf D\in \mathbb{R}^{M \times M}}{\mathrm{minimize}} ~~\mathbf \| \mathbf A_{w} - D \|_F \\ \mathrm{subject~ to} ~~ rank(D)~=~d \end{array} \ \ \ \ \ \ \ \ \ \ (8) The analytical solution to \mathcal P_{LR} comes in terms of the singular value decomposition of S . Specifically, let S admit SVD \mathbf S~=~V \Sigma V^T, \ \ \ \ \ \ \ \ \ \ (9) where U^T U~=~I_\rho , V^T V~=~I_\rho and \Sigma is the positive diagonal matrix containing the singular values of S . Then, if \Lambda_{(d)} ~=~ ( \left[ \sum \right]_{\lambda_{id},1:d})^2 and U^{(d)}~=\left[ U \right]_{i,1:d} , the solution to \mathcal P_{LR} is C. Eckart and G. Young, "The approximation of one matrix by another of lower rank," ''Psychometrika'', vol. 1, pp. 211-218, 1936. A_{w}^{(d)}~=~U^{(d)} \Lambda^{(d)} U^{(d)T}. \ \ \ \ \ \ \ \ \ \ (10) The ''Spannogram'' Algorithm (d = 1,2) In the sequel, for simplicity purposes we present the ''Spannogram'' algorithm for the special cases where d=1 and d=2. Detailed on ''Spannogram'' can be found in . Rank-1 Case We observe that A_{w}^{(1)}~=~U^{(1)} \Lambda^{(1)} U^{(1)T} , with U^{(1)} ~=~ \mathbf u \in {R}^{M \times 1} , \| \mathbf u \|_2 ~ =~1 , and \Lambda^{(1)} ~=~\lambda \in {R}_{++} . Then, for any \mathbf x \in {R}^{M \times 1} , \mathbf x^{T} A_{w}^{(1)} \mathbf x ~=~ \lambda \mathbf x^{T} \mathbf u \mathbf u^{T} \mathbf x ~=~ \lambda (\mathbf u^{T} \mathbf x)^2. \ \ \ \ \ \ \ \ \ \ (11) The solution to \mathcal P_{LR}: ~~~~ \begin{array}{l} \underset{\mathbf x \in \mathcal{Z} _{k}}{\mathrm{maximize}} ~~\mathbf \lambda (\mathbf u^{T} \mathbf x)^2 \end{array} \ \ \ \ \ \ \ \ \ \ (12) where \mathcal{Z}_{k} ~=~ \left \{ \mathbf x \in {R}^{M \times 1} : \| \mathbf x \|_2 = 1, \|\mathbf x\|_0 = k \right \} , is simply \mathbf x_{opt} = \frac {z} {\|z\|_2} , where \left [z \right ] = \left\{\begin{matrix} \left [ \mathbf u \right ]_i, & \mathrm{if}~i \in \mathcal{I}_k(\mathbf u), \mathrm{for~any} \mathbf x \in {R}^{M \times 1}, \\ 0& \mathrm{otherwise}. \end{matrix}\right. \mathcal{I}_k(\mathbf x) is the set containing the indices of the k largest (in the absolute sense) entries of \mathbf x . Rank-2 Case We observe that A_{w}^{(2)}~=~U^{(2)} \Lambda^{(2)} U^{(2)T} , with U^{(2)} ~=~ \left [ \mathbf u_1 \mathbf u_2 \right ] \in {R}^{M \times 2} , \| \mathbf u_1 \|_2 ~=~ \| \mathbf u_2 \|_2 ~ =~1 , \mathbf u_1^{T} \mathbf u_2 ~=~ 0 , and \Lambda^{(2)} ~=~\begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} \in {R}_{++}^{2 \times 2} for any \mathbf x \in {R}^{M \times 1} , \mathbf x^{T} A_{w}^{(2)} \mathbf x ~=~ \| V^{T} \mathbf x \|_2^{2}. \ \ \ \ \ \ \ \ \ \ (13) Then, in view of the Cauchy-Schwarz inequality, \|V^T \mathbf x \|_2^2 ~=~ \underset{\| \mathbf c \|_2 = 1} \mathrm{max} (C^T V^T \mathbf x)^2 and \underset{\mathbf x \in \mathcal{Z}_k} \mathrm{max}~ \mathbf x^T A_w^{(2)} \mathbf x ~=~ \underset{\mathbf x \in \mathcal{Z}_k} \mathrm{max} ~\underset{C \in \mathcal{S_L}} \mathrm{max} ~ \mathbf (C^T V^T \mathbf x)^2 ~=~ \underset{C \in \mathcal{S_2}} \mathrm{max} ~\underset{ \mathbf x \in \mathcal{Z}_k} \mathrm{max}~ \mathbf (C^T V^T \mathbf x)^2 \ \ \ \ \ \ \ \ \ \ (14) where \mathcal{S}_L is the surface of an L-dimensional hyper-sphere (that is, \mathcal{S}_L ~=~ \left \{ \mathbf x \in {R}^{L} : \|\mathbf x \|_2 = 1 \right \} ). Essentialy, for any point on the surface of a (2-dimensional) sphere, there is a unique angle \phi \in ( - \pi /2 , \pi /2 ] \mathbf c ~=~ \begin{bmatrix} sin \phi \\ cos \phi \end{bmatrix}. \ \ \ \ \ \ \ \ \ \ (15) Therefore, \underset{\mathbf x \in \mathcal{Z}_k} \mathrm{max}~ \mathbf x^T A_w^{(2)} \mathbf x ~=~ \underset{\phi \in ( - \pi /2 , \pi /2 ]} \mathrm{max} ~\underset{ \mathbf x \in \mathcal{Z}_k} \mathrm{max} (\mathbf v(\phi)^T \mathbf x)^2 \ \ \ \ \ \ \ \ \ \ (16) where \mathbf v(\phi) = sin(\phi) \sqrt{\lambda_2} \mathbf u_2 . Interestingly, for any fixed \phi , the solution to \underset{\mathbf x \in \mathcal{Z}_k} \mathrm{maximize}~ (\mathbf V(\phi)^T \mathbf x)^2, \ \ \ \ \ \ \ \ \ \ (17) is known by the aferomentioned solution analysis of the rank-1 case (see 14). Therefore, \underset{\mathbf x \in \mathcal{Z}_k} \mathrm{arg~max}~ \mathbf x^T A_w^{(2)} \mathbf x ~=~ \mathbf x_{opt}(\mathbf \phi_{opt}) \ \ \ \ \ \ \ \ \ \ (18) where \phi_{opt} = \underset{\phi \in ( - \pi /2 , \pi /2 ]} \mathrm{arg~max} \mathbf x(\phi)_{opt}^T A_w^{(2)} \mathbf x_{opt} and \mathbf x(\phi)_{opt} is the solution to the problem in (17) for any fixed \phi \in ( - \pi /2 , \pi /2 ] . At first sight, it seems like one should exhaustively browse uncountably infinite angles to find (18). However, quite importantly, it is proven that one suffices to look for \phi_{opt} in an easily-found size- (4 \binom {M} {2}) {/math} set of angles that correspond to different \mathcal{I}_{k}(\mathbf v(\phi)) . Therefore, instead of \binom {M}{k} , (4 \binom {M} {2}) = O(M^2) {/math} candidate solutions to these one, in fact, \underset{\mathbf x \in \mathcal{Z}_k} \mathrm{maximize}~ \mathbf x^T A_w^{(2)} \mathbf x. For more information on sparse PCA, we refer the interested reader to References